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Abstract 

We propose several new schedules for Strassen-Winograd's matrix mul- 
tiplication algorithm, they reduce the extra memory allocation require- 
ments by three different means: by introducing a few pre-additions, by 
overwriting the input matrices, or by using a first recursive level of clas- 
sical multiplication. In particular, we show two fully in-place schedules: 
one having the same number of operations, if the input matrices can be 
overwritten; the other one, slightly increasing the constant of the leading 
term of the complexity, if the input matrices are read-only. Many of these 
schedules have been found by an implementation of an exhaustive search 
algorithm based on a pebble game. 

Keywords: Matrix multiplication, Strassen-Winograd's algorithm, Memory 
placement. 

1 Introduction 

Strassen's algorithm [16] was the first sub-cubic algorithm for matrix multipli- 
cation. Its improvement by Winograd [17] led to a highly practical algorithm. 
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The best asymptotic complexity for this computation has been successively im- 
proved since then, down to O (n 2 376 ) in [5] (see [3, 4] for a review), but Strassen- 
Winograd's still remains one of the most practicable. Former studies on how 
to turn this algorithm into practice can be found in [2, 9, 10, 6] and references 
therein for numerical computation and in [15, 7] for computations over a finite 
field. 

In this paper, we propose new schedules of the algorithm, that reduce the extra 
memory allocation, by three different means: by introducing a few pre-additions, 
by overwriting the input matrices, or by using a first recursive level of classical 
multiplication. These schedules can prove useful for instance for memory effi- 
cient computations of the rank, determinant, nullspace basis, system resolution, 
matrix inversion... Indeed, the matrix multiplication based LQUP factorization 
of [11] can be computed with no other temporary allocations than the ones 
involved in its block matrix multiplications [12]. Therefore the improvements 
on the memory requirements of the matrix multiplication, used together for in- 
stance with cache optimization strategies [1], will directly improve these higher 
level computations. 

We only consider here the computational complexity and space complexity, 
counting the number of arithmetic operations and memory allocations. The 
focus here is neither on stability issues, nor really on speed improvements. We 
rather study potential memory space savings. Further studies have thus to be 
made to assess for some gains for in-core computations or to use these sched- 
ules for numerical computations. They are nonetheless already useful for exact 
computations, for instance on integer/rational or finite field applications [8, 14]. 

The remainder of this paper is organized as follows: we review Strassen- 
Winograd's algorithm and existing memory schedules in sections 2 and 3. We 
then present in section 4 the dynamic program we used to search for schedules. 
This allows us to give several schedules overwriting their inputs in section 5, and 
then a new schedule for C *— AB + C using only two extra temporaries in section 
6, all of them preserving the leading term of the arithmetic complexity. Finally, 
in section 7, we present a generic way of transforming non in-place matrix 
multiplication algorithms into in-place ones (i.e. without any extra temporary 
space), with a small constant factor overhead. Then we recapitulate in table 10 
the different available schedules and give their respective features. 

2 Strassen-Winograd Algorithm 

We first review Strassen-Winograd's algorithm, and setup the notations that 
will be used throughout the paper. 

Let m, n and k be powers of 2. Let A and B be two matrices of dimension m x k 
and k x n and let C = A x B. Consider the natural block decomposition: 
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where An and Bn respectively have dimensions m/2 x k/2 and k/2 x n/2. 
Winograd's algorithm computes the to x n matrix C = Ax B with the following 
22 block operations: 
• 8 additions: 



Si < — A21 + A22 5*2 
Ti <— B12 — B11 T2 
S4 < — A12 — S2 



Si — An S3 
B22 — Ti T 3 



An -A21 
B22 — B12 
T 2 — B 2 i 



7 recursive multiplications: 



Pi «- 
^3 «- 
P5 «- 

7 final additions: 



An x Bn 

S4 X B22 

Si x Ti 



P2 
Pa 
Pe 



A12 x B 2 l 
A22 x Ti 
S 2 x T 2 



P 7 ^S 3 x T 3 



Ui + 


-Pl + P2 


u 2 


I-Pl 


+ p 6 


u 3 ^ 


-U2 + P7 


Ui 




+ P5 


u 5 + 


-u 4 + p 3 


u 6 


^u 3 


-Pi U 7 ^U 3 + P 5 



• The result is the matrix: C — 



Ui U 5 
U e U 7 

Figure 1 illustrates the dependencies between these tasks. 



3 Existing memory placements 

Unlike the classic multiplication algorithm, Winograd's algorithm requires some 
extra temporary memory allocations to perform its 22 block operations. 

3.1 Standard product 

We first consider the basic operation C <— A x B. The best known schedule 
for this case was given by [6]. We reproduce a similar schedule in table 1. It 
requires two temporary blocks X and Y whose dimensions are respectively equal 
to m/2 x max(fc/2,n/2) and k/2 x n/2. Thus the extra memory used is: 

to (k n\ kn (m k n 

Ei(m,k,n) = -m ax ^-,-j+--+E 1 (^-,-,- 

Summing these temporary allocations over every recursive levels leads to a total 
amount of memory, where for brevity M = min {to, k, n}: 

log 2 (M) 

Ei(m,k,n)= ^ — (mmax(fc,n) + kn) (1) 

i=l 

= l i 1 ~ w) ( mmax ( fc ' n ) + fcn ) 

< i (to max (k,n) + kn) . 
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U7 




U5 




U6 



Figure 1: Winograd's task dependency graph 



We can prove in the same manner the following lemma: 

Lemma 1. Let m, k and n be powers of two, g{x,y 1 z) be homogeneous, M 
min{m, fc,n} and f(m, k,n) be a function such that 



/(ra, k,n) 



_M?'Lf)+/(f'L§) if™,nandk>l 



otherwise. 



Then f (m, fc, n) = | (l — -p^) g(m, k, n) < ^g(m, k, n). 

In the remainder of the paper, we use Ei to denote the amount of extra 
memory used in table number i. The amount of extra memory we consider is 
always the sum up to the last recursion level. 

Finally, assuming m — n = k gives a total extra memory requirement of 
£i(n,n,ra) < 2/3n 2 . 



3.2 Product with accumulation 

For the more general operation C <— aA x B + /3C, a first naive method would 
compute the product a A x B using the scheduling of table 1, into a temporary 
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# 


operation 


loc. 




operation 


loc. 


1 


s 3 


= An - A 2 i 


X 


12 


Pi = AnPn 


X 


2 


T 3 


= P22 — B12 


Y 


13 


U 2 = Pi + P 6 


C12 


3 


Pr 


= S 3 T 3 


C21 


14 


U 3 = U 2 + P 7 


C*21 


4 


Si 


= A 2 i + A 22 


X 


15 


U 4 = U 2 + P 5 




5 


Ti 


= B12 — Bn 


Y 


16 


U 7 = U 3 + P 5 


C*22 


6 


P 5 


= SiT! 


C22 


17 


U 5 = Ut + P 3 


C12 


7 


s 2 


= Si - A - n 


X 


18 


T4 — T2 — B21 


y 


8 


T 2 


= B22 — Ti 


Y 


19 


P 4 = A22T4 




9 


P 6 


= S2T2 


C12 


20 


U 6 = U 3 - Pa 


C*21 


10 


S 4 


= A\2 — S2 


X 


21 


P2 = A12P21 


On 


11 


Ps 


— S4B22 


Cn 


22 


Ui = Pi + P 2 





Table 1: Winograd's algorithm for operation C <— Ax _B, with two temporaries 



matrix C and finally compute C <— C +fiC. It would require (l + 2/3)n 2 extra 
memory allocations in the square case. 

Now the schedule of table 2 due to [10, fig. 6] only requires 3 temporary blocks 
for the same number of operations (7 multiplications and 4+15 additions). The 



# 


operation 


loc. 


# 


operation 


loc. 


1 


Si = A21 + A22 




12 


S4 = A12 — S2 


X 


2 


Ti = P12 — B11 


r 


13 


T 4 = T2 — B21 


Y 


3 


Ps = aSiTi 


z 


14 


C12 = aS 4 P22 + C12 


C12 


4 


C22 = Ps + PC22 


C22 


15 


U 5 = U2 + C12 


C12 


5 


C12 = Ps + /3Ci2 


C\2 


16 


P 4 = QA22T4 - PC21 


C21 


6 


S2 = Si — An 


X 


17 


S3 = An - A21 


X 


7 


T2 = B22 — Ti 


Y 


18 


T3 = B22 — P12 


Y 


8 


Pi = aAuBu 


Z 


19 


Us = aS 3 T 3 + U 2 


Z 


9 


C11 = Pi + PC11 


C11 


20 


Ur = U 3 + C22 


C22 


10 


U 2 = aS 2 T 2 + Pi 


z 


21 


U e = U 3 - C21 


C21 


11 


Ui = aiA 12 B 2 i + C11 C„ 


22 







Table 2: Schedule for operation C <— a A x B + (3C with 3 temporaries 



required three temporary blocks X, Y, Z have dimensions m/2 x fc/2, fc/2 x n/2 
and m/2 x n/2. Since the two temporary blocks in schedule 1 are smaller than 
the three ones here, we have E2 ^ E\. Hence, using lemma 1, we get 

E 2 (to, k, n) = ^ ^1 - -j^J + kn + mn) . (2) 

With m = n = k, this gives ^(n, n, n) < n 2 . 

We propose in table 9 a new schedule for the same operation aA x B + f3C only 
requiring two temporary blocks. 

Our new schedule is more efficient if some inner calls overwrite their temporary 
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input matrices. We now present some overwriting schedules and the dynamic 
program we used to find them. 

4 Exhaustive search algorithm 

We used a brute force search algorithm 1 to get some of the new schedules that 
will be presented in the following sections. It is very similar to the pebble game 
of Huss-Lederman et al. [10]. 

A sequence of computations is represented as a directed graph, just like figure 1 
is built from Winograd's algorithm. 

A node represents a program variable. The nodes can be classified as initials 
(when they correspond to inputs), temporaries (for intermediate computations) 
or finals (results or nodes that we want to keep, such as ready-only inputs). 
The edges represent the operations; they point from the operands to the result. 
A pebble represents an allocated memory. We can put pebbles on any nodes, 
move or remove them according to a set of simple rules shown below. 
When a pebble arrives to a node, the computation at the associated variable 
starts, and can be "partially" or "fully" executed. If not specified, it is assumed 
that the computation is fully executed. 

Edges can be removed, when the corresponding operation has been computed. 
The last two points are especially useful for accumulation operations: for exam- 
ple, it is possible to try schedule the multiplication separately from the addition 
in an otherwise recursive AB + C call; the edges involved in the multiplication 
operation would then be removed first and the accumulated part later. They 
are also useful if we do not want to fix the way some additions are performed: 
if Us = Pi + Pa + Pj the associativity allows different ways of computing the 
sum and we let the program explore these possibilities. At the beginning of 
the exploration, each initial node has a pebble and we may have a few extra 
available pebbles. The program then tries to apply the following rules, in order, 
on each node. The program stops when every final node has a pebble or when 
no further moves of pebbles are possible: 

• Rule 0. Computing a result/removing edges. If a node has a pebble and 
parents with pebbles, then the operation can be performed and the correspond- 
ing edges removed. The node is then at least partially computed. 

• Rule 1. Freeing some memory /removing a pebble. If a node is isolated 
and not final, its pebble is freed. This means that we can reclaim the memory 
here because this node has been fully computed (no edge pointing to it) and is 
no longer in use as an operand (no edge initiating from it). 

• Rule 2. Computing in place/moving a pebble. If a node P has a full 
pebble and a single empty child node S and if other parents of S have pebbles 
on them, then the pebble on P may be transferred to S (corresponding edges 
are removed). This means an operation has been made in place in the parent 
P's pebble. 

lr The code is available at http://ljk.imag.fr/CASYS/LOGICIELS/Galet. 
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• Rule 3. Using more memory/adding a pebble. If parents of an empty node 
N have pebbles and a free pebble is available, then this pebble can be assigned 
to TV and the corresponding edges are removed. This means that the operation 
is computed in a new memory location. 

• Rule 4. Copying some memory /duplicating a pebble. A computed node 
having a pebble can be duplicated. The edges pointed to or from the original 
node are then rearranged between them. This means that a temporary result 
has been copied into some free place to allow more flexibility. 

5 Overwriting input matrices 

We now relax some constraints on the previous problem: the input matrices 
A and B can be overwritten, as proposed by [13]. For the sake of simplicity, 
we first give schedules only working for square matrices (i.e. m = n = k and 
any memory location is supposed to be able to receive any result of any size). 
We nevertheless give the memory requirements of each schedule as a function 
of m; k and n. Therefore it is easier in the last part of this section to adapt the 
proposed schedules partially for the general case. In the tables, the notation 
AijBij (resp. AijBij + C%j) denotes the use of the algorithm from table 1 (resp. 
table 2) as a subroutine. Otherwise we use the notation Alg(AijBij) to denote 
a recursive call or the use of one of our new schedules as a subroutine. 

5.1 Standard product 

We propose in table 3 a new schedule that computes the product C <— A x B 
without any temporary memory allocation. The idea here is to find an order- 
ing where the recursive calls can be made also in place such that the operands 
of a multiplication are no longer in use after the multiplication has completed 
because they are overwritten. An exhaustive search showed that no schedule 
exists overwriting less than four sub-blocks. Note that this schedule uses only 



# 


operation 


loc. 


# 


operation 


loc. 


1 


s 3 


= An - A 2 i 


Cn 


12 


S4 = Ai 2 — S 2 


A22 


2 


Si 


= A 2 i + A 22 


A 2 i 


13 


P 6 = IP(S 2 T 2 ) 


C22 


3 


Ti 


— Bi 2 — Bn 


C22 


14 


U 2 =Pi + P 6 


C22 


4 


Ts 


= B 22 — B12 


B12 


15 


P 2 = IP(Ai 2 B 2 i) 


C12 


5 


Pi 


= IP(SsT 3 ) 


C21 


16 


Ui = Pi + P 2 


Cn 


6 


s 2 


= Si- An 


C12 


17 


Ua = U 2 + P 5 


C12 


7 


Pi 


= iP(AiiBu) 


Cn 


18 


U 3 = U 2 + P 7 


C22 


8 


T 2 


= B 22 - Ti 


Bn 


19 


U 6 = Us - Pi 


C21 


9 


ft 


= IP(SiTi) 


An 


20 


U 7 = U 3 + P 5 


C22 


10 




— T 2 — B 2 i 


C22 


21 


Ps = IP(SiB 22 ) 


A 12 


11 


Pi 


= IP (A 22 T 4 ) 


A21 


22 


U 5 = Ui + Ps 


C12 



Table 3: IP schedule for operation C <— A x B in place 



7 



two blocks of B and the whole of A but overwrites all of A and B. For instance 
the recursive computation of Pi requires overwriting parts of A\i and B21 too. 
Using another schedule as well as back-ups of overwritten parts into some avail- 
able memory In the following, we will denote by IP for InPlace, either one of 
these two schedules. 

We present in tables 4 and 5 two new schedules overwriting only one of the two 
input matrices, but requiring an extra temporary space. These two schedules 
are denoted OvL and OvR. The exhaustive search also showed that no schedule 
exists overwriting only one of A and B and using no extra temporary. We note 



# 


operation 


loc. 


# 


operation 


loc. 


1 


S3 


= Aii - A21 


C22 


12 


P 6 = 0vL(&T 2 ) 


C21 


2 


Si 


= A 2 i + A22 


A21 


13 


T4 = T2 — B21 


All 


3 


s 2 


= Si- An 


C\2 


14 


U 2 =Pi + P 6 


C21 


4 


Ti 


= B12 — Bn 


C21 


15 


U 4 = U 2 + P 5 


C12 


5 


Pi 


= OvL(AiiSn) 


Cn 


16 


U 3 = U 2 + Pi 


C21 


6 


n 


= B22 - Bl2 


An 


17 


U 7 = Us + Ps 


C22 


7 


Pi 


= IPOS3T3) 


X 


18 


U 5 = U 4 + Ps 


C12 


8 


T 2 


= B22 — Ti 


An 


19 


P 2 = 0vL(A 12 B 2 i) 


X 


9 


Ps 


= IP(SiTi) 


C22 


20 


Ui = Pi + P 2 


Cn 


10 


s 4 


= A12 — S2 


C21 


21 


Pi = IP(A 22 T 4 ) 


A21 


11 


Ps 


= 0vL(S , 4 -B 22 ) 


A21 


22 


U 6 = U z - Pi 


C21 



Table 4: OvL schedule for operation C <— A x B using strictly two blocks of A 
and one temporary 





operation 


loc. 


# 


operation 


loc. 


1 


S3 


= An - A21 


C22 


12 


Pi = 0vR(A 22 T 4 ) 


B12 


2 


Si 


= A21 + A22 


C21 


13 


5*4 = Ai 2 — S 2 


Bn 


3 


Ti 


= B12 — Bn 


C12 


14 


U 2 =Pi + P 6 


C21 


4 


Pi 


= OvR(AiiBu) 


C11 


15 


Ui = U 2 + P 5 


C12 


5 


S* 2 


= Si- An 


B11 


16 


U3 = u 2 + P 7 


C21 


6 


T 3 


= B22 — B12 


B12 


17 


U 7 = Us + Ps 


C22 


7 


Pi 


= IP(S 3 T 3 ) 


X 


18 


U 6 = U 3 - Pi 


C21 


8 


T 2 


= P 22 — Ti 


B\2 


19 


Ps = IP(S 4 P 22 ) 


B\2 


9 


Ps 


= IP(SiTi) 


C22 


20 


U 5 = Ui + Ps 


C\2 


10 


Ti 


= T 2 — P21 


C\2 


21 


P 2 = 0vR(Ai 2 B 2 i) 


B\2 


11 


Ps 


= 0vR(S , 2 T 2 ) 


C21 


22 


Ui = Pi + P 2 


Cn 



Table 5: OvR schedule for operation C <— A x B using strictly two blocks of B 
and one temporary 

that we can overwrite only two blocks of A in OvL when the schedule is modified 
as follows: 
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# 


operation 


loc. 


18bis 


A 2 i = Copy(Ai 2 ) 


A21 


19bis 


A 12 = Copy(A 2 i) 


A 12 


21 


Pa = 0vR(A 22 T 4 ) 


A21 



Similarly, for OvR, we can overwrite only two blocks of B using copies on lines 
20 and 21 and OvL on line 19. 

We now compute the extra memory needed for the schedule of table 5. The size 
of the temporary block X is (J) , the extra memory required for table 5 hence 
satisfies: -£5(71,71,71) < \n 2 . 

5.2 Product with accumulation 

We now consider the operation C <— otA x B + (3C, where the input matrices A 
and B can be overwritten. We propose in table 6 a schedule that only requires 
2 temporary block matrices, instead of the 3 in table 2. This is achieved by 
overwriting the inputs and by using two additional pre-additions {Z\ and Z2) 
on the matrix C . We also propose in table 7 a schedule similar to table 6 



# 


operation 


loc. 


# 


operation 




loc. 


1 


Zi — C22 — C12 


C22 


13 


P 4 = kcUt{aA 2 2T 4 - 


-pz 2 ) 


C 2 i 


2 


Si — A21 + A22 


X 


14 


S4 — A12 — S2 




A 22 


3 


Ti — B12 — Bn 


Y 


15 


P 6 = aI?(S 2 T 2 ) 




X 


4 


Z2 — C21 — Zi 


C21 


16 


P 2 = AcLR(aAi2-B 2 l 


+0Cu) 


Cn 


5 


T3 — B22 — B12 


B12 


17 


Ui = Pi + P 2 




Cn 


6 


S3 = An - A 2 i 


A21 


18 


U 2 = Pi + P e 




X 


5 


P 7 = kcLK(aS 3 T 3 +0Zi) 


C22 


17 


U 3 = U 2 + Pv 




C22 


8 


S 2 = Si- An 


A21 


20 


Ui = U 2 + P5 




X 


9 


T 2 — B22 — Ti 


B12 


21 


U 6 = U 3 - Pi 




C21 


10 


P 5 = AcLR(aSiTi+/3Ci 2 ) 


C12 


22 


U T = U 3 + P 5 




C22 


11 


Pi = alP(AnBn) 


Y 


23 


P 3 = aIP(S 4 B 2 2) 




C12 


12 


T4 — T2 — B21 


X 


24 


U 5 = Ui + P 3 




C12 



Table 6: AcLR schedule for C <— aA x B + j3C overwriting A and B with 2 
temporaries, 4 recursive calls 



overwriting only for instance the right input matrix. It also uses only two 
temporaries, but has to call the OvR schedule. The extra memory required by 
X and Y in table 6 is 2 (^) . Hence, using lemma 1: 

2 

E 6 (n,n,n) < -n 2 . (3) 

The extra memory Ey(ti, n, n) required for table 7 in the top level of recursion 
is: 

fn\ 2 fn\ 2 . (n n n\ 

(2) +(2) +maX ^^l2'2'2)- 
We clearly have E-j > E5 and: 

2 

£7(71, n, n) < -n 2 . 
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# 


operation 


loc. 


# 


operation 


loc. 


1 


Zi — C22 — C12 


C22 


13 


P 2 = AccR(qAi 2 B2i4-/3Ch) 


Cn 


2 


Ti = B 12 — Bn 


X 


14 


S2 — Si — An 


y 


3 


Z2 — C21 — Z\ 


C21 


15 


P 6 = a0vR(S 2 T 2 ) 


B21 


4 


T3 — P22 — -B12 


P12 


16 


Si — A\2 — S2 


Y 


5 


S 3 = An - A 2 i 


Y 


17 


U 2 = Pi + Pe 


B21 


6 


P 7 = AccR(qS 3 T 3 +/3Zi) 


C22 


18 


U 3 = U 2 + Pi 


C22 


7 


Si = A21 + A22 


Y 


19 


U 4 = U 2 + P5 


B21 


8 


T2 — -B22 — T\ 


B\2 


20 


U 6 = U 3 - Pi 


C21 


9 


P 5 = AccR(aSiTi+/3C*i 2 ) 


C12 


21 


Ui = Pi + P 2 


Cn 


10 


T4 — T2 — B21 


X 


22 


U T = U 3 + P 5 


C22 


11 


Pi = AccR(aA 2 2T 4 -/3Z 2 ) 


C21 


23 


P 3 = aIP(S 4 B 2 2) 


C12 


12 


Pi = aDvR(AnBn) 




24 


U 5 = U 4 + P 3 


C12 



Table 7: AccR schedule for C <— aA x B + /3C overwriting B with 2 temporaries, 
4 recursive calls 



Compared with the schedule of table 2, the possibility to overwrite the input 
matrices makes it possible to have further in place calls and replace recursive 
calls with accumulation by calls without accumulation. We show in theorem 3 
that this enables us to almost compensate for the extra additions performed. 

5.3 The rectangular case 

We now examine the sizes of the temporary locations used, when the matrices 
involved do not have identical sizes. We want to make use of table 3 for the 
general case. 

Firstly, the sizes of A and B must not be bigger than that of C (i.e. we need 
k ^ min (m, n)). Indeed, let's play a pebble game that we start with pebbles on 
the inputs and 4 extra pebbles that are the size of a dj . No initial pebble can 
be moved since at least two edges initiate from the initial nodes. If the size of 
Aij is larger that the size of the free pebbles, then we cannot put a free pebble 
on the Si nodes (they are too large). We cannot put either a pebble on Pi or 
P2 since their operands would be overwritten. So the size of is smaller or 
equal than that of Cij . The same reasoning applies for Bij . 
Then, if we consider a pebble game that was successful, we can prove in the 
same fashion that either the size of A or the size of B can not be smaller that 
of C (so one of them has the same size as C) . 

Finally, table 3 shows that this is indeed possible, with k — n ^ m. It is also 
possible to switch the roles of m and n. 

Now in tables 4 to 7, we need that A, B and C have the same size. Generalizing 
table 3 whenever we do not have a dedicated in-place schedule can then done by 
cutting the larger matrices in squares of dimension min (m, k, n) and doing the 
multiplications / product with accumulations on these smaller matrices using 
algorithm 1 to 7 and free space from A, B or C. Since algorithms 1 to 7 require 
less than n 2 extra memory, we can use them as soon as one small matrix is free. 
We now propose an example in algorithm 1 for the case n < min (m, k) : 

Proposition 1. Algorithm 1 computes the product C = AB in place, overwrit- 
ing A and B. 
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Algorithm 1 IPOvMM: In-Place Overwrite Matrix Multiply 
Input: A and B of resp. sizes to x k and k x n 
Input: n < min (to, k) and m, fc, n powers of 2. 
Output: C = Ax B 

1: Let fco = k/n and mo = m/n. 





"^1,1 








"Si" 




"Ci' 


2: Split A = 












and C = 






ArriQ ,1 








Bfco. 




c fco 



where Ajj and 



have dimension n x n 
3: Ci <— Ai iBi > with alg. of table 1 and memory C*2. 

4: Now we use A\ t ± as temporary space. 

5: for i = 2 ... fco do 

6: Ci <— A lt \B\ > with alg. of table 4. 

7: end for 

8: for j = 2 ... fc do 

9: for i = l... to do 

10: Cj <— A it jBj + Cj > with alg. of table 2. 

11: end for 

12: end for 



Finally, we generalize the accumulation operation from table 7 to the rect- 
angular case. We can no longer use dedicated square algorithms. This is done 
in table 8, overwriting only one of the inputs and using only two temporaries, 
but with 5 recursive accumulation calls: 



# 


operation 


loc. 


# 


operation 


loc. 


1 


Z± — C22 — C12 


C22 


13 


P 2 = AcR(a J 4i 2 B2l+/3Cn 


) Cn 


2 


Ti — B12 — Bn 




14 


Ui = Pi + P 2 


Cn 


3 


Z2 — C21 — Zi 


C21 


15 


S 2 = Si - A11 


y 


4 


T3 — B22 — B\2 


-B12 


16 


U 2 = AcR(qS 2 T 2 + Pi) 


X 


5 


S3 = M\ - A 2 i 




17 


(7 3 = U 2 + P7 


C22 


6 


P 7 = AcR(aS 3 T 3 +/3Zi) 


C22 


18 


U 6 = U 3 - P 4 


C21 


7 


Si = A21 + A22 


y 


19 


U 7 = Us + Ps 


C22 


8 


T 2 = B22 - Ti 


B12 


20 


Ui = U 2 + Ps 


X 


9 


P 5 = AcR(aSiTi+/3Ci 2 ) 


C12 


21 


S4 — Ai 2 — S2 


Y 


10 


T4 — T 2 — B21 


X 


22 


Ps = aS 4 S 2 2 


C\2 


11 


Pi = kcR(aA2 2 Ti-f)Z2) 


C21 


23 


U 5 = t/ 4 + P 3 


C12 


12 


Pi = ctAiiBn 


X 


24 







Table 8: AcR schedule for C <— xB + (3C with 5 recursive calls, 2 temporaries 
and overwriting B 



For instance, in table 8, the last multiplication (line 22, P3 = aS^f^) could 
have been made by a call to the in place algorithm, would C12 be large enough. 
This is not always the case in a rectangular setting. 

Now, the size of the extra temporaries required in table 8 is max § )§ + 
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?r| and E$(m, k, n) is equal to: 

(m k\n mk , ( m k n\ 

maX { r 2 ) 2 + "2 2 + ^ {Ea > El) U ' 2 ' 2 ' ' 

If m < k < n or k < m < n, then E 8 (m, k, n) < £i(m, k, n): 





i n 


m 


fc 


l-^i 


/ m 


fc 




2/ 


)<T 


2 


2 " 


U' 


2 


V 




v n 


m 


k 




m n 




k n 


2y 


)<T 


2 


2 




2 2 


+ 


2 2 



< max 

Otherwise E 8 (m 7 k, n) > E\(m, k, n) and: 

E%(m, k,n) < ^ (max (m, fc)n + mfc) . 
In the square case, this simplifies into E%(n,n,n) ^ |n 2 . 

In addition, if the size of -B is bigger than that of A, then one can store S*2, for 
instance within B 12 , and separate the recursive call 16 into a multiplication and 
an addition, which reduces the arithmetic complexity. Otherwise, a scheduling 
with only 4 recursive calls exists too, but we need for instance to recompute S4 
at step 21. 



6 Hybrid scheduling 

By combining techniques from sections 3 and 5, we now propose in table 9 
a hybrid algorithm that performs the computation C <— aA x B + f3C with 
constant input matrices A and B, with a lower extra memory requirement than 
the scheduling of [10] (table 2). We have to pay a price of order n 2 log(n) extra 
operations, as we need to compute the temporary variable T2 twice. 



# 


operation 


loc. 


# 


operation 


loc. 


1 


Z\ — C22 — C12 


C22 


14 


P 2 = Acc(qAi 2 B21+/3Ch) Ch 


2 


Z3 — C12 — C21 


C12 


15 


Ui = Pi + P 2 


Cu 


3 


Si = A21 + A22 


X 


16 


U 5 = t/ 2 + P 3 


C12 


4 


Ti — B12 — -B11 


Y 


17 


S 3 = An - A 2 i 


X 


5 


P 5 = Acc(aSiTi+/3Z 3 ) 


C12 


18 


T3 — P22 — B\2 


Y 


6 


S 2 = Si - An 


X 


19 


U 3 = P7 + % 


C21 


7 


T2 — B22 — Ti 






= aAcLR(S 3 T 3 +;72) 




8 


Pe = Acc(aS 2 T 2 +/3C 2 i 


) C21 


20 


U T = U 3 + Wi 


C22 


9 


S4 — A12 — S2 


X 


21 


T[ = B12 - B11 


Y 


10 


Wi = P 5 + 0Zr 


C22 


22 


T 2 = S22 - Tj 


Y 


11 


P 3 = Acc(aS 4 B 2 2+P5) 


C12 


23 


T 4 = Ti, - B21 


Y 


12 


Pi = aAiiBn 


X 


24 


U 6 = U 3 - P4 


C21 


13 


U 2 = Pe + Pi 


C21 




= -akccR(A 2 2Ti-U 3 ) 





Table 9: Acc schedule for operation C <— aA x B + (3C with 2 temporaries 
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Again, the two temporary blocks X and Y have dimensions X s = Y s = 
(n/2) 2 so that: 

E 9 =Y S + max {X s + E 9 ,X S + E 6 , E s } 

In all cases, Eq + X s ^ E$. But X s + Y s is not as large as the size of the two 
temporaries in table 6. We therefore get: 

E 9 (m,k,n)=Y s +X s +E 6 [ - J 

< 2 © 2 4((f) S + (f)> 

Assuming m = n = k, one gets E 9 (n, n, n) < |n 2 , which is smaller than the 
extra memory requirement of table 2. 



m k n 
T' 2' 2 



7 A sub-cubic in-place algorithm 

Following the improvements of the previous section, the question was raised 
whether extra memory allocation was intrinsic to sub-cubic matrix multiplica- 
tion algorithms. More precisely, is there a matrix multiplication algorithm com- 
puting C <— A x B in O (n log2 7 ) arithmetic operations without extra memory 
allocation and without overwriting its input arguments? We show in this sec- 
tion that a combination of Winograd's algorithm and a classic block algorithm 
provides a positive answer. Furthermore this algorithm also improves the extra 
memory requirement for the product with accumulation C <— aA x B + f3C. 

7.1 The algorithm 

The key idea is to split the result matrix C into four quadrants of dimension 
n/2 x n/2. The first three quadrants Cn, C\i and C21 are computed using fast 
rectangular matrix multiplication, which accounts for 2fc/n standard Winograd 
multiplications on blocks of dimension n/2 x n/2. The temporary memory 
for these computations is stored in C*22- Lastly, the block C22 is computed 
recursively up to a base case, as shown on algorithm 2. This base case, when 
the matrix is too small to benefit from the fast routine, is then computed with 
the classical matrix multiplication. 

Theorem 1. The complexity of algorithm 2 is: 

G(n,n) = 7.2n los ^ - Yin 2 + 6.8n 

when k = n. 
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Algorithm 2 IPMM: In-Place Matrix Multiply 



Input: A and B, of dimensions resp. n x k and k x n with k, n powers of 2 
and k > n. 



Output: C = A x B 
1: Split C 
where 



C21C22 









i4 2 ,i 




^2,2fc/n 







£>1,2 


and -B = 










B2k/n,2 



each 



and C, 



have dimension 
n/2 x n/2. 
do 

Gn = Ai^Bi,! 

C12 = -4l,l-Bl,2 

C21 = ^2,1-Bii 
end do 

for i = 2 ... 2* do 

n 

Cn = Ai^Bi^i + C11 
C12 = Ai^Bi^ + C12 

C2I = ^2,4-8^1 + C2I 

end for 

C22 = ^2,* x i?*^ 



> with alg. of table 1 using C22 as temp, space 



> with alg. of table 2 using G22 as temporary space: 



> recursively using IPMM. 



Proof. Recall that the cost of Winograd's algorithm for square matrices is 
W(n) = 6n log 2 7 - 5n 2 for the operation C <— AxB and W acc (n) = 6n log2 7 -4n 2 
for the operation C <— A x B + C. The cost G(n, k) of algorithm 2 is given by 
the relation 

G(n, k) = 3W(n/2) + 3(2fc/n - l)W acc (n/2) + G(n/2, k), 

the base case being a classical dot product: G(l, k) = 2k — 1. Thus, G(n, fc) = 
T^fen 106 ^ 7 )- 1 - Ylkn- n 2 + 34fc/5. □ 

Theorem 2. For any to, n and k, algorithm 2 is in place. 

Proof. W.l.o.g, we assume that to ^ n > 1 (otherwise we could use the trans- 
pose) . The exact amount of extra memory from algorithms in table 1 and 2 is 
respectively given by eq. (1) and (2). 

If we cut B into pi stripes at recursion level i, then the sizes for the involved 
submatrices of A (resp. B) are m/2 l x k/pt (reps. k/p% x n/2 1 ). The lower right 
corner submatrix of C that we would like to use as temporary space has a size 
to/2 4 x n/2 1 . Thus we need to ensure that the following inequality holds: 

(m k n\ m n 
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It is clear that E\ < E 2 , which simplifies the previous inequality. Let us now 
write K — k/pi, M = to/2 1 and N = n/2 1 . We need to find, for every i an 
integer pi > 1 so that eq. (4) holds. In other words, let us show that there exists 
some K < k such that, for any (M, N), the inequality E 2 (M,K,N) < MN 
holds. Then the fact that E(M, 2, N) < \{2M + 2N + MN) \{AM + MN) < 
MN provides at least one such K . 

As the requirements in algorithm 2 ensure that k > N and M = N, there just 
remains to prove that E(M, N, N) s^MN. Since E(M,N,N) < \{2MN + N 2 ) 
and again M > N, algorithm 2 is indeed in place. □ 

Hence a fully in-place O (n log2 7 ) algorithm is obtained for matrix multipli- 
cation. The overhead of this approach appears in the multiplicative constant of 
the leading term of the complexity, growing from 6 to 7.2. 
This approach extends to the case of matrices with general dimensions, using 
for instance peeling or padding techniques. 

It is also useful if any sub-cubic algorithm is used instead of Winograd's. For in- 
stance, in the square case, one can use the product with accumulation in table 9 
instead of table 2. 

7.2 Reduced memory usage for the product with accumu- 
lation 

In the case of computing the product with accumulation, the matrix C can no 
longer be used as temporary storage, and extra memory allocation cannot be 
avoided. Again we can use the idea of the classical block matrix multiplication 
at the higher level and call Winograd algorithm for the block multiplications. 
As in the previous subsection, C can be divided into four blocks and then the 
product can be made with 8 calls to Winograd algorithm for the smaller blocks, 
with only one extra temporary block of dimension n/2 x n/2. 
More generally, for square n x n matrices, C can be divided in t 2 blocks of 
dimension j x j. Then one can compute each block with Winograd algorithm 
using only one extra memory chunk of size (n/t) 2 . The complexity is changed 
to R t {n) = t 2 tW acc (n/t), which is R t (n) = 6t 3 - log ^ n lo ^(7) _ 4^2 for an 
accumulation product with Winograd's algorithm. Using the parameter t, one 
can then balance the memory usage and the extra arithmetic operations. For 
example, with t = 2, 

n 2 

R 2 = 6.857n log 2 7 - 8n 2 and ExtraMem = — 

4 

and with t — 3, 

n 2 

R 3 = 7.414n log2 7 - 12n 2 and ExtraMem = —. 

Note that one can use the algorithm of table 9 instead of the classical Wino- 
grad accumulation as the base case algorithm. Then the memory overhead drops 
down to and the arithmetic complexity increases to R t (n)+t 2 ^ l ° e2 ^n l ° S2 ^ — 
tn 2 . 
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8 Conclusion 



With constant input matrices, we reduced the number of extra memory alloca- 
tions for the operation C <— a A x B + f3C from n 2 to |n 2 , by introducing two 
extra pre- additions. As shown below, the overhead induced by these supple- 
mentary additions is amortized by the gains in number of memory allocations. 

If the input matrices can be overwritten, we proposed a fully in-place sched- 
ule for the operation C <— AxB without any extra operations . We also proposed 
variants for the operation C <— A x B, where only one of the input matrices 
is being overwritten and one temporary is required. These subroutines allow 
us to reduce the extra memory allocations required for the C <— aA x B + fiC 
operation without overwrite: the extra required temporary space drops from n 2 
to only |n 2 , at a negligible cost. 

Some algorithms with an even more reduced memory usage, but with some 
increase in arithmetic complexity, are also shown. Table 10 gives a summary 
of the features of each schedule that has been presented. The complexities are 
given only for m = k = n being a power of 2. 

Theorem 3. The arithmetic and memory complexities of table 10 are correct. 

Proof. For the operation AxB, the arithmetic complexity of the schedule of 
table 1 classically satisfies 

fT^ 1 (n)=7T^ 1 (§) + 15(f) 2 
\Wi(l)=l 

so that Wi(n) = 6n los ^ 7 ) - 5n 2 . 
The schedule of table 1 requires 

fM 1 (n)=2(f) 2 + M 1 (f) 
\Mi(l)=0 

extra memory space, which is M\{n) = §n 2 . Its total number of allocations 

satisfies A x {n) = 2 (§) 2 + 7A 1 (f ) which is A^n) = §(n lo ^( 7 ) - n 2 ). 

The schedule of table 4 requires Mi(n) — (f ) 2 + M4 (f ) extra memory 
space, which is M^n) = \n 2 . Its total number of allocations satisfies A^n) = 

(f ) 2 + 4A 4 (f) which is A 4 {n) = \n 2 log 2 (n). 

The schedule of table 5 requires the same amount of arithmetic operations 
or memory. 

For A x B + f3C, the arithmetic complexity of [10] satisfies 
W 2 (n) = 5W 2 Q)+2W 1 (^)+u(^)\ 

hence W2(n) = 6n l ° S2 ^ — 4n 2 ; its memory overhead satisfies M 2 (n) = 3 (f) 2 + 
Mi (§) , which is Mi(n) = n 2 ; its total number of allocations satisfies A 2 (n) = 
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total 










Algorithm 


Input matrices 


# of extra 
temporaries 


extra 
mem- 
ory 


total # of extra 
allocations 


arithmetic co 


mplexity 


cq 

X 


Table 1 [6] 


Constant 


2 


h 2 


|(n 2 ' 807 -n 2 ) 


6n 2 ' 807 - 


5n 2 


Table 3 


Both Overwritten 











6n 2 - 807 - 


5n 2 


Table 4 or 5 


A or B Overwritten 


1 


¥ 2 


±n 2 log 2 (n) 


6n 2 ' 807 - 


5n 2 




7.1 


Constant 











7.2n 2 ' 807 - 


- 13n 2 


O 


Table 2 [10] 


Constant 


3 


n 2 


| n log2(7) +n log 2 (5)_ 

§n 2 


6n 2 - 807 - 


4n 2 


+ 


Table 6 


Both Overwritten 


2 


2 2 

¥ 


in 2 log 2 (n) 


6n 2 ' 807 - 4n 2 + 


|n 2 log 2 (n) 




Table 7 


B Overwritten 


2 


2n 2.322 _ 2n 2 


6n 2 ' 807 - 4n 2 + 


in 2 log 2 (n) 


X 
3 


Table 9 


Constant 


2 


k 


2n 2 ' 807 + 2n 2 - 322 -f n 2 


6n 2-807 _ 4n 2 + 


|n 2 log 2 (n) 


7.2 


Constant 


N/A 




6.857n 2 ' 807 


- 8n 2 




7.2 


Constant 


N/A 




|n 2 


7.414n 2 ' 807 


- 12n 2 



Table 10: Complexities of the schedules presented for square matrix multiplication 



3 (§) 2 + 5A 2 (§) + 2A 1 (§) , which is 

A 2 (n) = \n i°8 2 (7) +n io g2 (5)_ 5 2 

The arithmetic complexity of the schedule of table 6 satisfies 

W 6 (n) = AW e ^)+3W 1 Q+ 17 

so that Wq(ti) = 6n loS2 ' 7 ^ — An 2 + |n 2 log 2 (n); its number of extra memory sat- 
isfies M 6 (n) = 2 (|) 2 + M 6 (§) , which is M 6 (n) = |n 2 ; its total number of allo- 
cations satisfies A 6 (n) = 2 (f ) + 4A 6 (^) , which is A 6 (n) = n 2 + \r? log 2 (n). 
The arithmetic complexity of table 7 schedule satisfies 

W 7 (n) = AW 7 (2) + ^ (2) + 2^ 5 (=) + 16 (|) 2 , 

so that W 7 (n) = 6n log2 ( 7 ' - 4n 2 + in 2 log 2 (n); its number of extra memory 
satisfies M 7 (n) = 2 (|) 2 + M 7 (|) , which is M 7 (n) = |n 2 ; its total number 
of allocations satisfies A 7 (n) = 2 (|) 2 + AA 7 (§) + 2A 5 (|) , which is A 7 (n) = 
2 n iog 2 (5) „ 2n 2 . 

The arithmetic complexity of the schedule of table 9 satisfies 
W 9 (n) = AW, (2) + ^ (2) + 2^ 6 (2) + 17 Q) 2 , 

so that W 9 (n) = 6n log 2( 7 ) - 4n 2 + fn 2 (log 2 (n) - ^) + |; its number of extra 
memory satisfies M 9 (n) = 2 (§) 2 + M 9 (|) , which is M 9 (n) = |n 2 ; its total 
number of allocations satisfies A g (n) = 2 (f ) 2 + 4A 9 (f ) + A 1 (f ) + 2A 6 (f ) , 
which is A 9 (n) = |n lo S2(7) + 2n lo S2( 5 ) - 32 n 2 + |. " " □ 

For instance, by adding up allocations and arithmetic operations in table 10, 
one sees that the overhead in arithmetic operations of the schedule of table 9 
is somehow amortized by the decrease of memory allocations. Thus it makes it 
theoretically competitive with the algorithm of [10] as soon as n > 44. 

Also, problems with dimensions that are not powers of two can be handled by 
combining the cuttings of algorithms 1 and 2 with peeling or padding techniques. 
Moreover, some cut-off can be set in order to stop the recursion and switch to 
the classical algorithm. The use of these cut-offs will in general decrease both 
the extra memory requirements and the arithmetic complexity overhead. 

For instance we show on table 11 the relative speed of different multiplication 
procedures for some double floating point rectangular matrices. We use atlas- 
3.9.4 for the BLAS and a cut-off of 1024. We see that pour new schedules 
perform quite competitively with the previous ones and that the savings in 
memory enable larger computations (MT for memory thrashing). 
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Dims, (m, k, n) 


Classic 


[6] 


IPMM 


IPOvMM 


(4096,4096,4096) 


14.03 


11.93 


13.59 


11.98 


(4096,8192,4096) 


28.29 


23.39 


27.16 


23.88 


(8192,8192,8192) 


113.07 


85.97 


98.75 


85.02 


(8192,16384,8192) 


231.86 


MT 


197.24 


170.72 



Table 11: Rectangular matrix multiplication: computation time in seconds on 
a core2 duo, 3.00GHz, 2x2Gb RAM 
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